431 Class 22

Thomas E. Love, Ph.D.

2024-11-14

Today’s Agenda

  • The here() package
  • Turning values like 77 and 99 into NA and vice versa
  • Selecting a multiple linear regression model for prediction of a quantitative outcome
  • Incorporating both single and multiple imputation to handle missing data
  • Our example: the msleep data from the ggplot2 package in the tidyverse

Today’s Packages

library(here)
library(naniar)
library(janitor)
library(broom)
library(gt)
library(mosaic)
library(car)
library(GGally)
library(mice)
library(xfun) 
library(easystats)
library(tidyverse)

theme_set(theme_bw())

The here() package

The here() package

https://here.r-lib.org/

The here package creates paths relative to the top-level directory. The package displays the top-level of the current project on load or any time you call here():

here()
[1] "D:/Teaching/431/2024-431/431-slides-2024"

The here() package

Jenny Bryan’s Ode to the here package

Jenny Bryan on Project-oriendted workflow

Using R projects and the here package

How can you avoid setwd() at the top of every script?

  1. Organize each logical project into a folder on your computer.
  2. Make sure the top-level folder advertises itself as such. If you use RStudio and/or Git, those both leave characteristic files that get the job done.
  3. Use the here() function to build the path when you read or write a file. Create paths relative to the top-level directory.
  4. Whenever you work on this project, launch the R process from the project’s top-level directory.

Creating / Replacing Missing Values

Turning values like 77 and 99 into NA

Suppose we have the following small data set, where 77 = “Refused” and 99 = “No Response” or some other term that we want to think of as “missing”.

var1 <- c(20, 22, 35, 19, 77, 99)
var2 <- c(1, 3, 4, 77, 6, 99)
var3 <- c("Yes", "No", 77, 99, "No", "Yes")
dat <- tibble(var1, var2, var3) |> mutate(var3 = factor(var3))

miss_var_summary(dat)
# A tibble: 3 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 var1          0        0
2 var2          0        0
3 var3          0        0

How can we convince R the 77s and 99s are missing values?

Use replace_with_na() from naniar

dat1 <- dat |>
  replace_with_na(
    replace = list(var1 = c(77, 99), var2 = c(77, 99),
                   var3 = c(77, 99)))
miss_var_summary(dat1)
# A tibble: 3 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 var1          2     33.3
2 var2          2     33.3
3 var3          2     33.3

More on replace_with_na() here

Replacing 77 and 99 with NA across all variables

dat2 <- dat |>
  replace_with_na_all(
    condition = ~.x %in% c(77, 99))
miss_var_summary(dat2)
# A tibble: 3 × 3
  variable n_miss pct_miss
  <chr>     <int>    <num>
1 var1          2     33.3
2 var2          2     33.3
3 var3          2     33.3

Other ways to extend replace_with_na() are described here

What if we have the opposite issue?

The replace_na() function from the tidyr package (details here) replaces an NA value with a specified value.

In that sense, it is the compliment to the replace_with_na() function.

Demo: Replacing NA with a value

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 
3    NA b    
df1 <- df |> replace_na(list(x = 0, y = "unknown"))
df1
# A tibble: 3 × 2
      x y      
  <dbl> <chr>  
1     1 a      
2     2 unknown
3     0 b      

More on replace_na() here

Mammals and How They Sleep

The mammals sleep data set (msleep)

msleep
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Today’s Variables of Interest

Variable Description
name Common name of mammal
vore Carni-, insecti-, herbi- or omnivore
brainwt brain weight in kilograms
bodywt body weight in kilograms
sleep_rem REM sleep, in hours
sleep_total total amount of sleep, in hours
awake (outcome) time spent awake, in hours

Modeling Plan

Note

Note that awake = 24 - sleep_total.

We want to predict awake using four potential predictors:

  • the type of food the mammal consumes (vore)
  • the weight of the mammal’s brain (brainwt)
  • the weight of the mammal’s body, (bodywt), and
  • the proportion of total time asleep spent in REM sleep.

Creating the ms431 data

ms431 <- msleep |>
  mutate(rem_prop = sleep_rem / sleep_total) |>
  mutate(vore = factor(vore)) |>
  select(name, vore, brainwt, bodywt, rem_prop, awake)

glimpse(ms431)
Rows: 83
Columns: 6
$ name     <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater short-ta…
$ vore     <fct> carni, omni, herbi, omni, herbi, herbi, carni, NA, carni, her…
$ brainwt  <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0.098…
$ bodywt   <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.045, 1…
$ rem_prop <dbl> NA, 0.10588235, 0.16666667, 0.15436242, 0.17500000, 0.1527777…
$ awake    <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 18.7,…

Exploring our factor variable

ms431 |> tabyl(vore) |> adorn_pct_formatting() |> gt()
vore n percent valid_percent
carni 19 22.9% 25.0%
herbi 32 38.6% 42.1%
insecti 5 6.0% 6.6%
omni 20 24.1% 26.3%
NA 7 8.4% -

Any concerns here?

Collapse to three vore groups?

favstats(awake ~ vore, data = ms431) |>
  gt() |> fmt_number(columns = min:sd, decimals = 3)
vore min Q1 median Q3 max mean sd n missing
carni 4.600 11.000 13.600 17.750 21.350 13.626 4.677 19 0
herbi 7.400 9.775 13.700 19.700 22.100 14.491 4.879 32 0
insecti 4.100 4.300 5.900 15.400 15.600 9.060 5.921 5 0
omni 6.000 13.075 14.100 14.900 16.000 13.075 2.949 20 0
ggplot(ms431, aes(x = awake, y = vore)) + geom_boxplot()

Exploring our quantities

df_stats(~ brainwt + bodywt + rem_prop + awake, data = ms431) |>
  gt() |> fmt_number(columns = min:sd, decimals = 3)
response min Q1 median Q3 max mean sd n missing
brainwt 0.000 0.003 0.012 0.126 5.712 0.282 0.976 56 27
bodywt 0.005 0.174 1.670 41.750 6,654.000 166.136 786.840 83 0
rem_prop 0.037 0.113 0.156 0.227 0.347 0.174 0.072 61 22
awake 4.100 10.250 13.900 16.150 22.100 13.567 4.452 83 0

Any concerns here?

A brainwt of 0?

which.min(ms431$brainwt)
[1] 17
slice(ms431, 17)
# A tibble: 1 × 6
  name                      vore  brainwt bodywt rem_prop awake
  <chr>                     <fct>   <dbl>  <dbl>    <dbl> <dbl>
1 Lesser short-tailed shrew omni  0.00014  0.005    0.154  14.9
  • How many of these mammals have a brainwt below 0.01 kg?
ms431 |> filter(brainwt < 0.01) |> nrow()
[1] 23

Collinearity Check

vif(lm(awake ~ vore + bodywt + brainwt + rem_prop, data = ms431))
             GVIF Df GVIF^(1/(2*Df))
vore     1.571683  3        1.078270
bodywt   1.658313  1        1.287755
brainwt  1.459939  1        1.208280
rem_prop 1.310584  1        1.144807
cor(ms431$bodywt, ms431$brainwt, use = "complete.obs")
[1] 0.9337822
  • What can we do about this?

Change our set of variables?

What if we included

  • brain_prop: brain weight as a proportion of body weight

along with bodywt in our model?

ms431 <- ms431 |>
  mutate(brain_prop = brainwt / bodywt) |>
  select(name, vore, bodywt, brain_prop, rem_prop, awake)

vif(lm(awake ~ vore + bodywt + brain_prop + rem_prop, data = ms431))
               GVIF Df GVIF^(1/(2*Df))
vore       1.685346  3        1.090891
bodywt     1.277174  1        1.130121
brain_prop 1.330301  1        1.153387
rem_prop   1.346185  1        1.160252

OK. Let’s move on to think about missingness.

How much missingness do we have?

miss_var_summary(ms431)
# A tibble: 6 × 3
  variable   n_miss pct_miss
  <chr>       <int>    <num>
1 brain_prop     27    32.5 
2 rem_prop       22    26.5 
3 vore            7     8.43
4 name            0     0   
5 bodywt          0     0   
6 awake           0     0   
miss_case_table(ms431)
# A tibble: 4 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0      43     51.8 
2              1      26     31.3 
3              2      12     14.5 
4              3       2      2.41

Missing Data Mechanisms

  • Missing completely at random There are no systematic differences between the missing values and the observed values.
    • For example, blood pressure measurements may be missing because of breakdown of an automatic sphygmomanometer.
  • Missing at random Any systematic difference between the missing and observed values can be explained by other observed data.
    • For example, missing BP measurements may be lower than measured BPs but only because younger people more often have a missing BP.
  • Missing not at random Even after the observed data are taken into account, systematic differences remain between the missing values and the observed values.
    • For example, people with high BP may be more likely to have headaches that cause them to miss clinic appointments.

“Missing at random” is an assumption that justifies the analysis, rather than a property of the data.

What assumption should we use?

Can we assume the data are MCAR, per Little’s test?

mcar_test(ms431)
# A tibble: 1 × 4
  statistic    df p.value missing.patterns
      <dbl> <dbl>   <dbl>            <int>
1      33.5    22  0.0549                6

With a small \(p\) value for the \(\chi^2\) test statistic, we would conclude that the ms431 data are not MCAR.

Reference

Little, Roderick J. A. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.” Journal of the American Statistical Association 83 (404): 1198–1202. doi:10.1080/01621459.1988.10478722.

If not MCAR, then what?

Suppose we assume that the data are MAR. This suggests the need for imputation of missing values.

Note

If we were willing to assume MCAR, we could simply do a complete case analysis.

Here, we have complete data on our outcome (awake) and bodywt so we won’t need to impute them.

  • We will need to impute vore (7), brain_prop (27) and rem_prop (22) from our sample of 83 mammals.
  • vore is a factor, the others are quantities.

Single Imputation

We’ll create a singly imputed data set first, to select our predictors.

set.seed(12345)
ms431_imp1 <- mice(ms431, m = 1, printFlag = FALSE)
ms431_imp <- complete(ms431_imp1)

prop_complete(ms431); prop_complete(ms431_imp)
[1] 0.8875502
[1] 1

Note

After we’ve settled on a final prediction model using ms431_imp, we’ll implement multiple imputation.

Range Check for impossible values?

ms431_imp |> tabyl(vore) |> adorn_pct_formatting() |> gt()
vore n percent
carni 19 22.9%
herbi 36 43.4%
insecti 5 6.0%
omni 23 27.7%
df_stats(~ brain_prop + bodywt + rem_prop + awake, data = ms431_imp) |>
  gt() |> fmt_number(columns = min:sd, decimals = 3)
response min Q1 median Q3 max mean sd n missing
brain_prop 0.001 0.003 0.008 0.017 0.040 0.011 0.009 83 0
bodywt 0.005 0.174 1.670 41.750 6,654.000 166.136 786.840 83 0
rem_prop 0.037 0.113 0.158 0.237 0.347 0.175 0.071 83 0
awake 4.100 10.250 13.900 16.150 22.100 13.567 4.452 83 0

Partitioning the ms431_imp data

  • Do we have a unique name to identify each mammal?
n_distinct(ms431_imp$name) == nrow(ms431_imp) # compare numbers
[1] TRUE
near(n_distinct(ms431_imp$name), nrow(ms431_imp)) # tidyverse approach
[1] TRUE
  • Partition 70% into training sample, remaining 30% to test, while maintaining a similar percentage by vore groups.
set.seed(20231128)
ms431_train <- slice_sample(ms431_imp, prop = 0.7, by = "vore")
ms431_test <- anti_join(ms431_imp, ms431_train, by = "name")

dim(ms431_train); dim(ms431_test)
[1] 57  6
[1] 26  6

Outcome transformation?

boxCox(lm(awake ~ vore + bodywt + brain_prop + rem_prop, data = ms431_train))

Scatterplot Matrix

ggpairs(ms431_train |> select(vore, bodywt, brain_prop, rem_prop, awake))

Collinearity Check in Training Sample

vif(lm(awake ~ vore + bodywt + brain_prop + rem_prop, data = ms431_train))
               GVIF Df GVIF^(1/(2*Df))
vore       1.725015  3        1.095129
bodywt     1.056919  1        1.028066
brain_prop 1.207572  1        1.098896
rem_prop   1.531482  1        1.237531

Which Potential Models Will We Fit?

  • Model 1: A simple regression on bodywt
  • Model 2: A simple regression on brain_prop
  • Model 3: A model with the two size variables (brain_prop and bodywt)
  • Model 4: Model 3 + vore
  • Model 5: Model 3 + rem_prop
  • Model 6: All four predictors

Model 6

m6 <- lm(awake ~ bodywt + brain_prop + vore + rem_prop, data = ms431_train)
tidy(m6, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 12.861509015 2.537865e+00 5.0678454 5.878093e-06 8.608288e+00 17.114730227
bodywt 0.001204781 5.970149e-04 2.0180077 4.897220e-02 2.042404e-04 0.002205321
brain_prop -83.198697107 6.012258e+01 -1.3838179 1.725609e-01 -1.839584e+02 17.561029934
voreherbi 1.442092605 1.601502e+00 0.9004628 3.721908e-01 -1.241872e+00 4.126057298
voreinsecti -7.614590737 2.500228e+00 -3.0455581 3.700684e-03 -1.180474e+01 -3.424445527
voreomni 0.753316104 1.637686e+00 0.4599881 6.475188e-01 -1.991290e+00 3.497922240
rem_prop 2.776453501 9.496943e+00 0.2923523 7.712270e-01 -1.313952e+01 18.692428736
glance(m6) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.322219 0.2408853 3.851854 324.0256 340.37 57 6 50 3.961691 0.002562502 -154.0128 741.839

Model 5

m5 <- lm(awake ~ bodywt + brain_prop + rem_prop, data = ms431_train)
tidy(m5, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.507494e+01 1.850906e+00 8.1446283 6.578536e-11 1.197631e+01 18.17357317
bodywt 1.359156e-03 6.426695e-04 2.1148606 3.916124e-02 2.832530e-04 0.00243506
brain_prop -1.022174e+02 6.214353e+01 -1.6448602 1.059177e-01 -2.062529e+02 1.81807281
rem_prop -6.547216e+00 8.630335e+00 -0.7586282 4.514356e-01 -2.099540e+01 7.90096864
glance(m5) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.1380569 0.08926767 4.219019 331.7264 341.9417 57 3 53 2.829659 0.0471244 -160.8632 943.4065

Model 4

m4 <- lm(awake ~ bodywt + brain_prop + vore, data = ms431_train)
tidy(m4, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 13.524618518 1.128207e+00 11.9877137 1.875095e-16 1.163455e+01 15.414686173
bodywt 0.001217627 5.900331e-04 2.0636597 4.415652e-02 2.291540e-04 0.002206101
brain_prop -86.220983537 5.869370e+01 -1.4689989 1.479748e-01 -1.845497e+02 12.107694999
voreherbi 1.188933544 1.335070e+00 0.8905400 3.773579e-01 -1.047690e+00 3.425556767
voreinsecti -7.650136315 2.474779e+00 -3.0912408 3.226971e-03 -1.179610e+01 -3.504177125
voreomni 0.587367948 1.522332e+00 0.3858343 7.012242e-01 -1.962972e+00 3.137707806
glance(m4) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.3210604 0.2544977 3.817162 322.123 336.4243 57 5 51 4.823428 0.001095717 -154.0615 743.1071

Model 3

m3 <- lm(awake ~ bodywt + brain_prop, data = ms431_train)
tidy(m3, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 13.841315609 8.805796e-01 15.718415 8.426394e-22 1.236761e+01 15.315022680
bodywt 0.001372401 6.399023e-04 2.144704 3.649326e-02 3.014830e-04 0.002443319
brain_prop -91.532194409 6.028794e+01 -1.518251 1.347846e-01 -1.924280e+02 9.363581154
glance(m3) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.1286972 0.09642676 4.202404 330.342 338.5142 57 2 54 3.98808 0.02424184 -161.171 953.6508

Model 2

m2 <- lm(awake ~ brain_prop, data = ms431_train)
tidy(m2, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 14.26293 0.8860011 16.098091 1.814412e-22 12.78062 15.745236
brain_prop -109.68223 61.6134523 -1.780167 8.057277e-02 -212.76363 -6.600831
glance(m2) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.0544791 0.03728781 4.337748 333.0016 339.1307 57 1 55 3.168994 0.08057277 -163.5008 1034.883

Model 1

m1 <- lm(awake ~ bodywt, data = ms431_train)
tidy(m1, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 12.817198699 0.5727327039 22.379024 2.665922e-29 1.18590e+01 13.775399966
bodywt 0.001508777 0.0006410394 2.353641 2.219168e-02 4.36296e-04 0.002581257
glance(m1) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.09150414 0.07498604 4.251971 330.7247 336.8538 57 1 55 5.539627 0.02219168 -162.3623 994.3591

Combining glance() results

bind_rows(glance(m1) |> mutate(name = "m1"),
          glance(m2) |> mutate(name = "m2"),
          glance(m3) |> mutate(name = "m3"),
          glance(m4) |> mutate(name = "m4"),
          glance(m5) |> mutate(name = "m5"),
          glance(m6) |> mutate(name = "m6")) |>
  select(name, r.squared, adj.r.squared, sigma, AIC, BIC, nobs, df) |>
  gt() |> fmt_number(decimals = 5, columns = 2) |>
    fmt_number(decimals = 4, columns = c(3:4)) |>
    fmt_number(decimals = 0, columns = c(5:6)) 
name r.squared adj.r.squared sigma AIC BIC nobs df
m1 0.09150 0.0750 4.2520 331 337 57 1
m2 0.05448 0.0373 4.3377 333 339 57 1
m3 0.12870 0.0964 4.2024 330 339 57 2
m4 0.32106 0.2545 3.8172 322 336 57 5
m5 0.13806 0.0893 4.2190 332 342 57 3
m6 0.32222 0.2409 3.8519 324 340 57 6

Move forward with m3, m4 and m5

test_m3 <- augment(m3, newdata = ms431_test) |> mutate(mod = "m3")
test_m4 <- augment(m4, newdata = ms431_test) |> mutate(mod = "m4")
test_m5 <- augment(m5, newdata = ms431_test) |> mutate(mod = "m5")

test_comp <- bind_rows(test_m3, test_m4, test_m5) |>
  arrange(name, mod)

Comparing Models: Test Sample

test_comp |>
  group_by(mod) |>
  summarize(n = n(),
            MAPE = mean(abs(.resid)), 
            RMSPE = sqrt(mean(.resid^2)),
            max_error = max(abs(.resid)),
            median_APE = median(abs(.resid)),
            valid_R2 = cor(awake, .fitted)^2) |>
  gt() |> fmt_number(decimals = 4, columns = -"n") 
mod n MAPE RMSPE max_error median_APE valid_R2
m3 26 3.6721 4.2655 7.9363 3.4188 0.1693
m4 26 4.1504 5.0685 11.1050 3.4314 0.0163
m5 26 3.6081 4.1256 7.1338 3.4065 0.2558

We’ll pick model m5.

Model 5 (training sample)

m5 <- lm(awake ~ bodywt + brain_prop + rem_prop, data = ms431_train)
tidy(m5, conf.int = T, conf.level = 0.90) |> gt()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.507494e+01 1.850906e+00 8.1446283 6.578536e-11 1.197631e+01 18.17357317
bodywt 1.359156e-03 6.426695e-04 2.1148606 3.916124e-02 2.832530e-04 0.00243506
brain_prop -1.022174e+02 6.214353e+01 -1.6448602 1.059177e-01 -2.062529e+02 1.81807281
rem_prop -6.547216e+00 8.630335e+00 -0.7586282 4.514356e-01 -2.099540e+01 7.90096864
glance(m5) |> select(r.squared, adj.r.squared, sigma, AIC, BIC, nobs,
                     df, df.residual, everything()) |> gt()
r.squared adj.r.squared sigma AIC BIC nobs df df.residual statistic p.value logLik deviance
0.1380569 0.08926767 4.219019 331.7264 341.9417 57 3 53 2.829659 0.0471244 -160.8632 943.4065

Create Multiple Imputations

How many subjects have missing data that affect this model?

ms431_sub <- ms431 |> select(name, awake, bodywt, brain_prop, rem_prop)

pct_miss_case(ms431_sub)
[1] 42.16867

We’ll build 50 imputed data sets.

set.seed(4312345)
ms431_mice <- mice(ms431, m = 50, printFlag = FALSE)

Summary of Imputation Process

summary(ms431_mice)
Class: mids
Number of multiple imputations:  50 
Imputation methods:
      name       vore     bodywt brain_prop   rem_prop      awake 
        ""  "polyreg"         ""      "pmm"      "pmm"         "" 
PredictorMatrix:
           name vore bodywt brain_prop rem_prop awake
name          0    1      1          1        1     1
vore          0    0      1          1        1     1
bodywt        0    1      0          1        1     1
brain_prop    0    1      1          0        1     1
rem_prop      0    1      1          1        0     1
awake         0    1      1          1        1     0
Number of logged events:  425 
  it im      dep     meth        out
1  0  0          constant       name
2  1  1 rem_prop      pmm brain_prop
3  1  2     vore  polyreg brain_prop
4  1  2 rem_prop      pmm brain_prop
5  1  3     vore  polyreg brain_prop
6  1  3 rem_prop      pmm brain_prop

Run Model 5 on each imputation

m5_mods <- with(ms431_mice, lm(awake ~ bodywt + brain_prop + rem_prop))

summary(m5_mods)
# A tibble: 200 × 6
   term          estimate std.error statistic  p.value  nobs
   <chr>            <dbl>     <dbl>     <dbl>    <dbl> <int>
 1 (Intercept)   16.4      1.36         12.1  1.27e-19    83
 2 bodywt         0.00184  0.000604      3.04 3.21e- 3    83
 3 brain_prop   -94.1     51.8          -1.82 7.29e- 2    83
 4 rem_prop     -12.7      6.68         -1.90 6.11e- 2    83
 5 (Intercept)   16.6      1.34         12.4  2.74e-20    83
 6 bodywt         0.00133  0.000590      2.25 2.70e- 2    83
 7 brain_prop  -132.      50.3          -2.62 1.05e- 2    83
 8 rem_prop     -10.9      6.13         -1.77 8.04e- 2    83
 9 (Intercept)   17.2      1.28         13.5  3.88e-22    83
10 bodywt         0.00152  0.000577      2.64 1.01e- 2    83
# ℹ 190 more rows

Pool Results across imputations

m5_pool <- pool(m5_mods)
summary(m5_pool, conf.int = TRUE, conf.level = 0.90)
         term     estimate    std.error statistic       df      p.value
1 (Intercept)  16.41952438 1.565542e+00 10.488075 55.55978 8.701785e-15
2      bodywt   0.00161955 6.275382e-04  2.580799 68.82462 1.199259e-02
3  brain_prop -91.25426722 5.961674e+01 -1.530682 54.92329 1.315879e-01
4    rem_prop -12.21467462 7.154355e+00 -1.707306 56.77613 9.322886e-02
            5 %         95 %
1  1.380077e+01 19.038277745
2  5.732576e-04  0.002665842
3 -1.909975e+02  8.488943366
4 -2.417774e+01 -0.251607082

Estimate R-square and Adjusted R-square

pool.r.squared(m5_mods)
          est      lo 95     hi 95       fmi
R^2 0.1687734 0.03999556 0.3425666 0.1226275
pool.r.squared(m5_mods, adjusted = TRUE)
              est      lo 95     hi 95       fmi
adj R^2 0.1368756 0.02252933 0.3079875 0.1468647

Residuals for mod5 (20th imputation)

par(mfrow = c(1,2)); plot(m5_mods$analyses[[20]], which = c(1:2))

Residuals for mod5 (20th imputation)

par(mfrow = c(1,2)); plot(m5_mods$analyses[[20]], which = c(3,5)); par(mfrow = c(1,1))

Which animal is that?

ms431 |> slice(36)
# A tibble: 1 × 6
  name             vore  bodywt brain_prop rem_prop awake
  <chr>            <fct>  <dbl>      <dbl>    <dbl> <dbl>
1 African elephant herbi   6654   0.000858       NA  20.7

Which are the heavy mammals?

ms431 |> filter(bodywt > 500)
# A tibble: 6 × 6
  name             vore  bodywt brain_prop rem_prop awake
  <chr>            <fct>  <dbl>      <dbl>    <dbl> <dbl>
1 Cow              herbi   600    0.000705   0.175   20  
2 Asian elephant   herbi  2547    0.00181   NA       20.1
3 Horse            herbi   521    0.00126    0.207   21.1
4 Giraffe          herbi   900.  NA          0.211   22.1
5 Pilot whale      carni   800   NA          0.0370  21.4
6 African elephant herbi  6654    0.000858  NA       20.7

More Details on MI modeling

m5_pool
Class: mipo    m = 50 
         term  m     estimate         ubar            b            t dfcom
1 (Intercept) 50  16.41952438 1.881836e+00 5.579277e-01 2.450923e+00    79
2      bodywt 50   0.00161955 3.562166e-07 3.685064e-08 3.938042e-07    79
3  brain_prop 50 -91.25426722 2.705601e+03 8.319159e+02 3.554155e+03    79
4    rem_prop 50 -12.21467462 3.993923e+01 1.102507e+01 5.118480e+01    79
        df       riv     lambda       fmi
1 55.55978 0.3024101 0.23219267 0.2584157
2 68.82462 0.1055191 0.09544756 0.1206354
3 54.92329 0.3136287 0.23874989 0.2650347
4 56.77613 0.2815670 0.21970523 0.2458125

Guidelines for reporting, I (Sterne)

How should we report on analyses potentially affected by missing data?

  • Report the number of missing values for each variable of interest, or the number of cases with complete data for each important component of the analysis. Give reasons for missing values if possible, and indicate how many individuals were excluded because of missing data when reporting the flow of participants through the study. If possible, describe reasons for missing data in terms of other variables (rather than just reporting a universal reason such as treatment failure.)
  • Clarify whether there are important differences between individuals with complete and incomplete data, for example, by providing a table comparing the distributions of key exposure and outcome variables in these different groups
  • Describe the type of analysis used to account for missing data (eg, multiple imputation), and the assumptions that were made (eg, missing at random)

Guidelines for reporting, II (Sterne)

How should we report on analyses that involve multiple imputation?

  • Provide details of the imputation modeling (software used, key settings, number of imputed datasets, variables included in imputation procedure, etc.)
  • If a large fraction of the data is imputed, compare observed and imputed values.
  • Where possible, provide results from analyses restricted to complete cases, for comparison with results based on multiple imputation. If there are important differences between the results, suggest explanations.
  • It is also desirable to investigate the robustness of key inferences to possible departures from the missing at random assumption, by assuming a range of missing not at random mechanisms in sensitivity analyses.

Session Information

xfun::session_info()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8          askpass_1.2.1        backports_1.5.0     
  base64enc_0.1.3      bayestestR_0.15.0    bigD_0.3.0          
  bit_4.5.0            bit64_4.5.2          bitops_1.0.9        
  blob_1.2.4           boot_1.3-31          broom_1.0.7         
  bslib_0.8.0          cachem_1.1.0         callr_3.7.6         
  car_3.1-3            carData_3.0-5        cellranger_1.1.0    
  cli_3.6.3            clipr_0.8.0          coda_0.19-4.1       
  codetools_0.2-20     colorspace_2.1-1     commonmark_1.9.2    
  compiler_4.4.1       conflicted_1.2.0     correlation_0.8.6   
  cowplot_1.1.3        cpp11_0.5.0          crayon_1.5.3        
  curl_6.0.0           data.table_1.16.2    datasets_4.4.1      
  datawizard_0.13.0    DBI_1.2.3            dbplyr_2.5.0        
  Deriv_4.1.6          digest_0.6.37        doBy_4.6.24         
  dplyr_1.1.4          dtplyr_1.3.1         easystats_0.7.3     
  effectsize_0.8.9     emmeans_1.10.5       estimability_1.5.1  
  evaluate_1.0.1       fansi_1.0.6          farver_2.1.2        
  fastmap_1.2.0        fontawesome_0.5.2    forcats_1.0.0       
  foreach_1.5.2        Formula_1.2-5        fs_1.6.5            
  gargle_1.5.2         generics_0.1.3       GGally_2.2.1        
  ggformula_0.12.0     ggplot2_3.5.1        ggridges_0.5.6      
  ggstats_0.7.0        glmnet_4.1-8         glue_1.8.0          
  googledrive_2.1.1    googlesheets4_1.1.1  graphics_4.4.1      
  grDevices_4.4.1      grid_4.4.1           gridExtra_2.3       
  gt_0.11.1            gtable_0.3.6         haven_2.5.4         
  here_1.0.1           highr_0.11           hms_1.1.3           
  htmltools_0.5.8.1    htmlwidgets_1.6.4    httr_1.4.7          
  ids_1.0.1            insight_0.20.5       isoband_0.2.7       
  iterators_1.0.14     janitor_2.2.0        jomo_2.7-6          
  jquerylib_0.1.4      jsonlite_1.8.9       juicyjuice_0.1.0    
  knitr_1.49           labeling_0.4.3       labelled_2.13.0     
  lattice_0.22-6       lifecycle_1.0.4      lme4_1.1-35.5       
  lubridate_1.9.3      magrittr_2.0.3       markdown_1.13       
  MASS_7.3-61          Matrix_1.7-0         MatrixModels_0.5.3  
  memoise_2.0.1        methods_4.4.1        mgcv_1.9.1          
  mice_3.16.0          microbenchmark_1.5.0 mime_0.12           
  minqa_1.2.8          mitml_0.4-5          modelbased_0.8.9    
  modelr_0.1.11        mosaic_1.9.1         mosaicCore_0.9.4.0  
  mosaicData_0.20.4    multcomp_1.4-26      munsell_0.5.1       
  mvtnorm_1.3-1        naniar_1.1.0         nlme_3.1-164        
  nloptr_2.1.1         nnet_7.3-19          norm_1.0-11.1       
  numDeriv_2016.8.1.1  openssl_2.2.2        ordinal_2023.12.4.1 
  pan_1.9              parallel_4.4.1       parameters_0.23.0   
  patchwork_1.3.0      pbkrtest_0.5.3       performance_0.12.4  
  pillar_1.9.0         pkgconfig_2.0.3      plyr_1.8.9          
  prettyunits_1.2.0    processx_3.8.4       progress_1.2.3      
  ps_1.8.1             purrr_1.0.2          quantreg_5.99       
  R6_2.5.1             ragg_1.3.3           rappdirs_0.3.3      
  RColorBrewer_1.1-3   Rcpp_1.0.13-1        RcppEigen_0.3.4.0.2 
  reactable_0.4.4      reactR_0.6.1         readr_2.1.5         
  readxl_1.4.3         rematch_2.0.0        rematch2_2.1.2      
  report_0.5.9         reprex_2.1.1         rlang_1.1.4         
  rmarkdown_2.29       rpart_4.1.23         rprojroot_2.0.4     
  rstudioapi_0.17.1    rvest_1.0.4          sandwich_3.1-1      
  sass_0.4.9           scales_1.3.0         see_0.9.0           
  selectr_0.4.2        shape_1.4.6.1        snakecase_0.11.1    
  SparseM_1.84.2       splines_4.4.1        stats_4.4.1         
  stringi_1.8.4        stringr_1.5.1        survival_3.7-0      
  sys_3.4.3            systemfonts_1.1.0    textshaping_0.4.0   
  TH.data_1.1-2        tibble_3.2.1         tidyr_1.3.1         
  tidyselect_1.2.1     tidyverse_2.0.0      timechange_0.3.0    
  tinytex_0.54         tools_4.4.1          tzdb_0.4.0          
  ucminf_1.2.2         UpSetR_1.4.0         utf8_1.2.4          
  utils_4.4.1          uuid_1.2.1           V8_6.0.0            
  vctrs_0.6.5          viridis_0.6.5        viridisLite_0.4.2   
  visdat_0.6.0         vroom_1.6.5          withr_3.0.2         
  xfun_0.48            xml2_1.3.6           xtable_1.8-4        
  yaml_2.3.10          zoo_1.8-12